PATH
MacOS X Server Release Notes Copyright \xa9 1998 by Apple Computer, Inc. All Rights Reserved.
These release notes describe aspects of the Apple Information Access Toolkit (AIAT). AIAT is a "search engine" that implements indexing and retrieval of documents, as well as less familiar operations such as the automatic routing of similar documents and the automatic generation of document summaries. Since it is strictly an engine, AIAT has no user interface. It does support a particular style of user interaction, described in Apple Information Access Toolkit Programmer's Guide and these release notes.
This document assumes that you have read the "Apple Information Access Toolkit Programmer's Guide" and are familiar with the various components of the IAT and the terminology used in that document. The Additions to the Apple Information Access Toolkit document provides updated reference material for that guide.
The main IAT functionality is designed to be platform-independent. However, there is still a need to provide platform-specific subclasses in storage and corpus subsystems of the IAT framework. These subclasses provide the functionality required to manage storage and access documents.
IAT provides classes that facilitate the storage of blocks of data into persistent storage. This storage is used by IAT to hold the information access indexes and structures. Indexes require persistent storage; this set of logical storage classes provides an interface to the storage media desired to hold the index information. Developers may also use these storage classes to store other data they wish to make persistent.
The IAT storage architecture is designed to be
platform-independent, but you should use platform-specific subclass
implementations to optimize performance. IAT provides a
MacOS-specific implementation of storage that uses the Macintosh HFS
file system. IAT also provides a Rhapsody-specific implementation of
storage that uses the Rhapsody file system.
MakeFileStorage is an implementation of the utility which constructs storage for a Rhapsody file. You must know the full path name to the storage file name before you can construct a Rhapsody storage.
Following creation, initialize the storage for use. This initialization creates the structures used to address blocks and opens the storage for writing.
#include "UFSStorage.h" // Client must provide the storage file name: StringPtr storageFileName = "storage.file"; // defaults to the current directory // create storage IAStorage* anIAStorage = MakeFileStorage (storageFileName); anIAStorage->Initialize();
To open an existing storage requires a storage object; opening restores data from persistent storage to the object. Storage may be opened as read-only or read and write access. Open(True) allows writes.
#include "UFSStorage.h" // Client must provide the storage file name: StringPtr storageFileName = "storage.file; // defaults to the current directory bool writable = true; // create storage IAStorage* anIAStorage = MakeFileStorage (storageFileName); anIAStorage->Open(writable);
The base unit of storage is a block. A block is a contiguous set of data that is written or read from storage as a whole. Individual bytes, words, or strings are accessed in the block once it is in memory. A block has a block ID that uniquely identifies it. This ID is of type IABlockID.
The storage object maintains a table of allocated blocks that maps each block to a specific location in physical storage. Objects using storage must know which block contains their desired data. They can do this by maintaining their own table of contents of storage, or they can request a named block in the internal storage table of contents and keep track of that block name rather than its ID. In this case, the storage maintains an internal table, known as the TOC (for "Table Of Contents"), which maps the block names to block IDs.
The following example allocates new UFSStorage by a named block. When a block of storage is first created, it is always an output block, which will allow data to be written to the block.
// create storage IAStorage* anStorage = MakeFileStorage(fileName); anStorage ->Initialize(); const char* aBlockName = "MY NAMED BLOCK"; // ask for a new block to be labeled with the given name IABlockID anIABlockID = anStorage->AllocateNamedBlock(aBlockName); IAOutputBlock anIAOutputBlock(anStorage, anIABlockID, anIABlockSize);
The example below establishes a named block of storage.
// create storage object bool writable = true; IAStorage* anStorage = MakeFileStorage(fileName); anStorage ->Open(writable); // get the pre-defined block ID const char* aBlockName = "MY NAMED BLOCK"; IABlockID anIABlockID = anStorage->TOC_Get(aBlockName); IAInputBlock anIAInputBlock(anStorage, anIABlockID);
You can allocate storage directly without using a named block
using the Allocate()
function. This function returns a
block ID which the application must keep track of.
Delete storage by deallocating a block using the
Deallocate(anIABlockID)
function for unnamed blocks, or
the RemoveNamedBlock(blockName)
function for named
blocks.
WARNING
If you useDeallocate()
to delete a named block (instead ofRemoveNamedBlock()
), you will leave the TOC entry for that name untouched. Unless you do a matchingTOC_Remove()
, you will render that name unusable for the remaining life of the index.
You may need to create a storage subclass if your persistent storage needs to be based on something other than the Macintosh HFS file system.
The IAStorage, IAInputBlock, and IAOutputBlock classes do not require a specialized subclass. But you need to subclass IAStoreStream, and you need to create a new utility to construct your storage.
Creating a Storage Construction Utility. You create storage
by creating a store stream and then an IAStorage object. To construct
storage, you must invoke the default construction utility
IAMakeFileStorage(IAStoreStream*)
. By supplying your
file type's store stream, you effectively create your file type's
storage subclass. The following code shows a storage construction
utility built to create Rhapsody storage.
#include "UFSStorage.h" #include "UFSStoreStream.h" IAStorage* MakeFileStorage(const char* pathname) { return IAMakeStorage(new UFSStoreStream(pathname)); }
Creating a Subclass of IAStoreStream. Because IAStoreStream is an abstract base class, it requires a subclass to do the actual storage of input and output; a subclass must support the actual storage I/O for a specific platform. See the documentation of the IAStoreStream class for detailed information. The class declaration below shows the Rhapsody implementation of IAStoreStream and its functions as an example.
class UFSStoreStream : public IAStoreStream { public: UFSStoreStream(const char* name); ~UFSStoreStream(); void Initialize(); void Open(bool writable); bool IsOpen(); bool IsWritable(); bool IsClone(); void Flush(); uint32 GetEOF(); void SetEOF(uint32 address); virtual IAStoreStream* Clone(); char* GetFileName() const {return fFileName;} protected: UFSStoreStream( const char* fileName, FILE* fd, bool isOpen, bool isWritable); // clone will use void Write(uint32 address, const byte* data, uint32 length); uint32 Read(uint32 address, byte* data, uint32 length); private: //... };
In the field of information retrieval a "corpus" is a collection of documents that is being searched. In IAT a corpus class provides the tools for identifying a set of documents as a collection and providing text from these documents so they can be indexed. The corpus is the interface between documents and the index. The corpus locates the document files and provides buffered text from these documents to the index and analysis objects. The corpus maintains the location of the collection of documents and, optionally, provides an iterator through them.
IAT provides an implementation that supports Rhapsody files and interfaces to the collection of files within an Rhapsody directory. There are two implementations of the corpus abstract classes. UFSCorpus provides access to the content in a Rhapsody file. UFSDirectoryCorpus provides, in addition, the ability to iterate through a directory and its subdirectories and select documents.
Use the corpus document iterator to provide all documents currently in the corpus, whether or not they are indexed. The following code example illustrates how to list all documents in a Rhapsody directory.
// build the corpus UFSDirectoryCorpus aDirectoryCorpus(directoryPathName); // get an iterator through the corpus IADocIterator* anIADocIterator = aDirectoryCorpus.GetDocIterator(); UFSDirectoryDoc* directoryDoc; while (directoryDoc = (UFSDirectoryDoc*)anIADocIterator->GetNextDoc()){ // NULL when no more text docs in folder printf(""); PrintDocName(directoryDoc); printf(""); }
A corpus is stored though its index. Generally a corpus is created at the same time an index is created. See the documentation on creating an index.
// choose a corpus implementation #include "UFSDirectoryCorpus.h" // choose an analysis implementation #include "SimpleAnalysis.h" // choose an index implementation #include "InVecIndex.h" // get the user information (using constants for the sake of this example) char* name = "recipes.index"; char* UFSDirectorName = "/myroot/Corpora/recipes"; // ... do your storage stuff here // create index for directory (creates corpus and analysis) InVecIndex anInVecIndex(aStorage, new UFSDirectoryCorpus(UFSDirectorName), new SimpleAnalysis());
The corpus is stored through its index. To establish an existing corpus, you must first establish its index and then address the corpus data member. The corpus is stored in the index as an IACorpus.
// establish the existing index containing the corpus // get the user information (using constants for the same of this example) char* storageName = "recipes.index"; char* UFSDirectoryName = "/myroot/Corpora/recipes"; bool writable = true; / reestablish storage for the index IAStorage * aStorage = MakeFileStorage(storageName); IADeleteOnUnwind delInxStorage(aStorage); aStorage ->Open(writable); // reestablish index for folder (reestablishes corpus and analysis) InVecIndex anInVecIndex(aStorage, new UFSDirectoryCorpus(UFSDirectoryName), new SimpleAnalysis()); anInVecIndex.Open();
If you need to create a corpus subclass, you generally need to create several subclasses:
You may also need to provide a subclass of IADocIterator if you
wish to provide an index Update()
function.
The UFSCorpus class characterizes a set of documents. It contains information identifying the directory parentage of the documents. You can use a UFSCorpus object to extract text from a Rhapsody file. This class is defined in the header file UFSCorpus.h.
class UFSCorpus : public IACorpus { public: UFSCorpus() : IACorpus(UFSCorpusType) {} virtual ~UFSCorpus() {}; // IACorpus methods IADoc* GetProtoDoc(); IADocText* GetDocText(const IADoc* doc); private: // ... };
UFSDoc contains the information to locate a document: its full path name. IADoc is the abstract class for the interface to the physical document. Any implementation must contain the data required to locate the actual document. An implementation of an IADoc sublcass requires a matching implementation of an IADocText subclass. The UFSDoc class is defined in the header file UFSCorpus.h.
class UFSDoc : public IADoc { public: UFSDoc() : fPathName(NULL) {} UFSDoc(const byte* p, bool makeCopy = true); virtual ~UFSDoc(); IAStorable* DeepCopy() const; uint32 StoreSize() const; void Store(IAOutputBlock* output) const; IAStorable* Restore(IAInputBlock* input) const; bool LessThan(const IAOrderedStorable* neighbor) const; bool Equal(const IAOrderedStorable* neighbor) const; byte* GetName(uint32 *length) const; void SetName(const byte* fullpath); byte* GetPath() const {return fPathName;} protected: void DeepCopying(const IAStorable* source); void Restoring(IAInputBlock* input, const IAStorable* proto); private: // ... };
IADocText provides the text from the actual document. An implementation of this must be able to locate the document and read its contents. This class is defined in the header file UFSCorpus.h.
class UFSDocText : public IADocText { public: UFSDocText() : fStream(NULL) {} UFSDocText(const byte* path); ~UFSDocText(); uint32 GetNextBuffer(byte* buffer, uint32 bufferLength); IADocText* DeepCopy() const; private: // ... };
The IADocIterator will locate the documents in the corpus in sequence. Hear is a sample header file for an IADocIterator subclass:
class UFSDirectoryCorpusIterator : public IADocIterator { public: UFSDirectoryCorpusIterator(UFSDirectoryCorpus* c) : corpus(c), ufsIterator(new UFSIterator(c->GetFullPath())) {} ~UFSDirectoryCorpusIterator() {delete ufsIterator; } IADoc* GetNextDoc(); private: UFSDirectoryCorpus* corpus; UFSIterator* ufsIterator; };
And here is a sample Implementation for GetNextDoc()
:
IADoc* UFSDirectoryCorpusIterator::GetNextDoc() { while (ufsIterator->Increment()) { struct stat* info = ufsIterator->GetFileInfo(); return new UFSDirectoryDoc(corpus, ufsIterator->GetPath(), ufsIterator->GetFileName(), info->st_mtime); } return NULL; }
UFSIterator
UFSIterator returns any file from a given directory. It recurses
all folders to get to the actual files. The Increment()
member function returns true
if a file has been
found and false
if there are no more files within the
directories. You can obtain file information (stat
) with
the GetFileInfo()
function and additional directory
GetPath()
and GetFileName()
functions. This
class is defined in the header file UFSIterator.h.
class UFSIterator : public IAObject { public: UFSIterator(const byte* pathname); ~UFSIterator(); bool Increment(); struct stat* GetFileInfo() const {return fStat;} byte* GetPath() const; byte* GetFileName() const; protected: // Accessors needed to override Increment() DirectoryInfo* GetDirInfos() const {return fDirInfos;} long GetDirCount() const {return fDirCount;} uint32 GetDir() const {return fDir;} void CollectDirInfo(const byte* name); private: // ... };
while (ufsIterator->Increment()) { struct stat* info = ufsIterator->GetFileInfo(); byte* name = ufsIterator->GetFileName(); // filter out non-valid and old documents // the definition for ValidType(), is up to you! if (info->st_mtime == today && ValidType(name)) { return new UFSDirectoryDoc(corpus, ufsIterator->GetPath(), ufsIterator->GetFileName(), info->st_mtime); } }
UFSDirectoryCorpus
The UFSDirectoryCorpus is a subclass of the IACorpus class. It maintains an iterator that, given a directory, returns documents within that directory and, recursively, in subdirectories. It chooses only documents that satisfy the client-defined criteria. Because UFSDirectoryDoc contains a modification date, only those selected documents (files) modified since the last update are submitted for re-analysis.
The client registers a function (using SetCriteriaFunction member function) which returns true if the current document satisfies the client-defined criteria (for example, if it is the right type) and otherwise returns false. By default all documents are selected. This class is defined in the header file UFSDirectoryCorpus.h.
typedef bool DocumentTypeFn(const char* fileName); // criteria function type class UFSDirectoryCorpus : public IACorpus { public: UFSDirectoryCorpus(uint32 type = UFSDirectoryCorpusType); UFSDirectoryCorpus(const byte* rootDirPath, uint32 type =UFSDirectoryCorpusType); virtual ~UFSDirectoryCorpus(); // IACorpus methods IADoc* GetProtoDoc(); // this will return UFSDirectoryDoc IADocText* GetDocText(const IADoc* doc); // this will return an UFSDocText IADocIterator* GetDocIterator(); // UFSDirectoryCorpus specific methods uint32 GetDirectoryID(const byte* fullPath); // allocate id for path. // returns an IAArray of path byte* GetDirectory(uint32 directoryID, uint32 *length); byte* GetFullPath() const {return fRootDirectory;} void SetCriteriaFunction(DocumentTypeFn* func); DocumentTypeFn* GetCriteriaFunction() const; protected: IABlockSize InitialSize(); void Initializing(IAOutputBlock* output); void Opening(IAInputBlock* input); IABlockSize UpdateSize(); void Updating(IAOutputBlock* output); UFSDirectoryInfo** GetDirectoryInfos () const {return fDirectoryInfos;} uint32 GetDirectoryCount() const {return fDirectoryCount;} void DeleteDirectoryInfos(); private: // ... };
UFSDirectoryDoc
Recall that IADoc is the abstract class for the interface to the
physical document. Any implementation of a concrete subclass must
provide the data required to locate the actual document. Any
implementation of an IADoc subclass requires a matching
implementation of an IADocText subclass. However, the
GetDocText()
member function of UFSDirectoryCorpus
returns an UFSDocText, and therefore we don't define an
UFSDirectoryDocText class. The UFSDirectoryDoc class is defined in
UFSDirectoryCorpus.h.
class UFSDirectoryDoc : public IADoc { public: UFSDirectoryDoc(UFSDirectoryCorpus* corpus, const byte* path, const byte* file, long date); UFSDirectoryDoc(); virtual ~UFSDirectoryDoc(); IAStorable* DeepCopy() const; uint32 StoreSize() const; void Store(IAOutputBlock* output) const; IAStorable* Restore(IAInputBlock* input) const; bool LessThan(const IAOrderedStorable* neighbor) const; bool Equal(const IAOrderedStorable* neighbor) const; byte* GetName(uint32 *length) const; uint32 GetDirectoryID() const; void SetModDate (long mDate) {fModDate = mDate;} uint32 GetModDate () const {return fModDate;} uint32 GetDirID() const {return fDirID;} byte* GetFileName() const {return fName;} protected: void DeepCopying(const IAStorable* source); void Restoring(IAInputBlock* input, const IAStorable* proto); private: // ... };
UFSDirectoryInfo
An object of the UFSDirectoryInfo class helps to reduce the size of corpus-related information in the index. It maps a directory ID (IAT generated) to parentage path names and the creation date. This mapping eliminates the need for storing full path names for each document in the same directory in the index. UFSDirectoryCorpus uses a collection of UFSDirectoryInfo instances for looking up directory IDs. This class is defined in UFSDirectoryCorpus.h.
class UFSDirectoryInfo : public IAStorable { public: UFSDirectoryInfo(); UFSDirectoryInfo(const byte* pathname); ~UFSDirectoryInfo(); // methods to store a UFSDirectoryInfo IABlockSize StoreSize() const; void Store(IAOutputBlock* output) const; IAStorable* Restore(IAInputBlock* input) const; IAStorable* DeepCopy() const; byte* GetDirectoryName() const {return fPath;} uint32 GetCreationDate() const {return fCreationDate;} void SetDirectoryName(const byte* pathname) {fPath = (byte*)pathname;} void SetCreationDate(uint32 cDate) {fCreationDate = cDate;} private: // .... };
UFSDirectoryNotFound - Specified directory is not found
UFSError - File System Error
A category for related documents is called a cluster. Clusters are represented by the IACluster class, which must be subclassed to handle particular document types. For example, the IAT provides the subclass UFSCluster, which represents a cluster of Rhapsody documents.
class UFSCluster: public IACluster { public: UFSCluster (IAIndex* index, const byte* path); virtual ~UFSCluster(); IADoc* GetNextDoc() const ; // returns the next document in the cluster. void Reset(); // reset to the first document };